# We may need to install tidyverse and ggrepel if they aren't already installed
# If so, run this code:
install.packages("tidyverse")
install.packages("ggrepel")An introduction to plotting with ggplot2
With real biological data
Why use ggplot2?
In this exercise, you’ll use data based on a real study rather than one of R’s inbuilt datasets. This gives you experience working with real biological data and shows you how to process such data to create your own visualisations using ggplot2(Wickham 2009).
The data comes from a study that identified transcriptomic signatures of tuberculosis (TB) infection(Singhania et al. 2018). The researchers used RNAseq to profile transcriptomes from patients with active TB, latent TB, those who later developed TB (TB progressors), and control patients.
For this practical, you’ll compare these groups and produce volcano plots and box plots/violin plots to summarise and interpret the results.
What you will need for this practical
To have R and Posit RStudio ready. We will work in RStudio.
To create a directory for this practical and save it somewhere easy to locate. You should call it ‘data_visualisation’. We suggest that you create this directory inside your Documents folder or on your Desktop.
To Download the Practical materials and save them inside a folder called ‘data’, inside your ‘data_visualisation’ directory. There are multiple csv files: ‘active_TB_vs_control_results.csv’, ‘TB_progressor_vs_control_results.csv’, ‘normalised_TB_data.csv’, ‘TB_metadata.csv’, within the Practical page in Canvas. You should make sure these are all inside the ‘data’ folder.
To change your working directory to the ‘data_visualisation’ folder.
To open a new .R script and save it in the ‘data_visualisation’ folder. You will continue working on this script as you proceed through the practical, using the example code we provide. I’d strongly recommend you type this example code rather than copying and pasting. You will make more mistakes but those mistakes will aid in your learning and retention!
Worked examples
Volcano plot
A volcano plot is a specific type of scatter plot that shows statistical significance (-log10 adjusted p-values) versus magnitude of change (log fold change). They are common in –omics analyses, where they help identify changes between two groups of subjects, e.g., treatment vs control. Each point represents a gene, protein, metabolite or similar.
We’ll start by reading in data (comparing patients with active TB and controls) and loading the R packages we need.
# Load libraries
library(tidyverse) # Loads ggplot2, magrittr, and dplyr
library(ggrepel) # For better text labels on plots# Remember that we use the <- to assign values to variables
# We use the # to add comments to our code, you should also use comments to help you remember why you did certain things
# Read in the data
activeTB_vs_control <- read.csv("data/active_TB_vs_control_results.csv", row.names=1)Before creating any plots, let’s quickly inspect the data:
# View dimensions of the data
dim(activeTB_vs_control)[1] 17786 6
# Look at the structure of the data
head(activeTB_vs_control, 10) logFC AveExpr t P.Value adj.P.Val B
FCGR1A 3.301767 6.149391 12.02848 1.223836e-21 2.176714e-17 38.65068
SERPING1 3.825079 6.433310 11.62863 9.709342e-21 8.634518e-17 36.60604
UBE2L6 1.801724 8.197470 11.54765 1.478635e-20 8.766333e-17 36.21297
GBP4 2.025176 6.142948 11.33775 4.406000e-20 1.656056e-16 35.14543
GBP5 2.653025 8.099339 11.32718 4.655505e-20 1.656056e-16 35.09147
WARS 1.992432 8.349681 11.19540 9.250438e-20 2.742138e-16 34.41573
BATF2 3.936097 5.022033 11.08332 1.659741e-19 4.217165e-16 33.74777
GBP2 1.521199 9.124542 11.04558 2.021065e-19 4.493333e-16 33.64526
APOL1 1.949465 6.590448 10.97231 2.962881e-19 5.372448e-16 33.27924
SCO2 2.230601 7.101993 10.96861 3.020605e-19 5.372448e-16 33.26115
You should see 17,786 rows (genes) and six columns (variables). Here’s what each column represents:
| Column name | Description |
|---|---|
| logFC | The log fold-change column, it is an estimate of the log2-fold-change of the mean expression between the comparisons of interest |
| Avexpr | The average expression of the gene across all samples (log2 scale) |
| t | Moderated t-statistic i.e. the ratio of log2FC to its estimated standard error |
| P.value | The raw p-values |
| adj.P.Val | The p-values adjusted for multiple/testing false discovery rate (FDR). The default Benjamini-Hochberg method was used to adjust the p-values for FDR here. |
| B | Log-odds that the gene is differentially expressed, assuming a certain fixed percentage of differentially expressed genes. Generally, the higher the value, the greater the chance the gene is differentially expressed |
Always use adjusted p-values to account for multiple testing.
Building the volcano plot step by step
Let’s start building our volcano plot using ggplot2. First, we’ll add a new column to our data, which transforms the adjusted p-values into the -log10 scale (this is for visualisation purposes):
# Add log10-adjusted p-value column
activeTB_vs_control$log10_p.adjust <- -log10(activeTB_vs_control$adj.P.Val)
# Confirm the new column is added
dim(activeTB_vs_control)[1] 17786 7
Now that we are familiar with the data and have made the adjustments we need, we can begin to construct our plot using ggplot2. As mentionedt in the lecture, ggplot2 uses the ‘grammar of graphics’ (shown below).
Let’s build the plot layer by layer:
Start with the data
First, let’s see what happens if we just provide the ggplot function with our data.
# Create a blank ggplot object with the data
p <- ggplot(data = activeTB_vs_control)
# Display the empty canvas
pThis results in just a blank canvas. All we have done is store our data inside our ggplot2 object, and provided no additional information on the aesthetics or geometries.
Add axis mappings
Next, let’s provide ggplot2 with our x and y axes using the aes function inside our ggplot function…
# Map the logFC to x and log10_p.adjust to y
p <- ggplot(data = activeTB_vs_control, aes(x=logFC, y=log10_p.adjust))
# Display the grid
pNow, instead of a blank canvas we get a grid with our x and y axes shown.
It is common for new ggplot2 users to miss off the closing brackets when providing the aesthetics inside the ggplot function, if you are getting errors make sure this is not the case!
Add points (geometries)
Next, we are going to tell ggplot which geometries to use. This is one of the aspects that makes ggplot so versatile but can also be one of the most difficult aspects to grasp.
Here we are telling ggplot2 whether to display our data as lines, bars, points, etc. We do so by adding a ‘+’ sign after our initial ggplot() call, which is adding our extra layer.
# Add points to the plot
p <- ggplot(activeTB_vs_control, aes(logFC, log10_p.adjust)) +
geom_point()
pNow, we have produced a (very basic!) volcano plot. Each point represents an individual gene plotted by their logFC and -log10 adjusted p-values. What observations can you make about this plot, do you notice any patterns? What happens if you provide the adjusted p-values (adj.P.Val) instead of the -log10 transformed p-values? Does this make any differences to the shape of the data?
So far our volcano plot is very plain, we are now going to improve it step by step, using other features of ggplot2. We can keep adding new layers by using the ‘+’ sign at the end of a line (never at the beginning). It is recommended to move onto a new line after each ‘+’ to make your code more readable.
Before we proceed, it is worth stating that you do not need to manually specify data =, x= and y= each time you use ggplot2. As long as these arguments are in the same order ggplot2 will automatically recognise them.
# So this
p <- ggplot(data = activeTB_vs_control, aes(x=logFC, y=log10_p.adjust)) +
geom_point()
p
# Can become this
p <- ggplot(activeTB_vs_control, aes(logFC, log10_p.adjust)) +
geom_point()
pImprove information
Add significance thresholds
Currently, it is difficult to tell which genes are statistically significant looking at our plot. We are going to add dashed lines to our x- and y- axes to show which genes meet thresholds for significance.
p <- ggplot(activeTB_vs_control, aes(logFC, log10_p.adjust)) + # Data and mapping/aesthetics
geom_point() + # Geometries, points
geom_hline(yintercept = -log10(0.05), linetype = "dashed") + # Geometry, horizontal line
geom_vline(xintercept = c(-1, 1), linetype = "dashed") # Geometry, vertical line
pgeom_hline adds a horizontal line to our plot. We tell it to intercept the y-axis at -log10(0.05). This is because our threshold for significance is p-adjust < 0.05, and we need to -log10 transform this value to match the transformation we made to the data.
geom_vline adds vertical line(s) to our plot. Here you can see we told it to produce multiple lines by providing a vector of values c(-1, 1). You could add additional lines by including additional numbers inside the vector - try this yourself! Here -1 and 1 log2FC would correspond to the gene expression either halving (down-regulated) or doubling (up-regulated) relative to the control. From our new volcano plot we can see more genes are up-regulated in the active TB group compared to the control.
Add gene labels
It would be useful if we could see specific gene names on this plot. This will be the next layer we add. Here we will make use of the other library we loaded earlier, ggrepel. This allows us to use the geometry geom_text_repel, which is an improvement on ggplot2's native geom_label function. Essentially, this stops the text labels overlapping each other. In the code below we also set a ‘seed’, which essentially means that our labels will be in the same place every time, rather than positioned randomly - this helps with reproducibility with our figures.
Let’s label the top 10 most significant genes using geom_text_repel().
# Get top 10 most significant genes
top_genes <- activeTB_vs_control %>%
rownames_to_column(var = "Genes") %>% # Add a new column to the dataframe containing the gene names
arrange(adj.P.Val) %>% # Arrange in decending order of adjusted p-value
head(10) # Take the top 10
# Set seed to ensure positioning of labels is the same every time
set.seed(42)
# Add gene labels to the plot
p <- ggplot(activeTB_vs_control, aes(logFC, log10_p.adjust)) +
geom_point() +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
geom_text_repel(data = top_genes, aes(label=Genes), max.overlaps = Inf, min.segment.length = 0)
pWe now have a volcano plot with significance thresholds and the most significant genes labelled. Can you change the significance thresholds on your plot and increase/decrease the number of gene names displayed?
Improve aesthetics
Our plot is now a lot more informative but it is still quite ugly! In this next section, we’ll improve the visual appearance by adjusting the axis labels, theme, and colours.
Axis-labels
p <- p + xlab(expression("log2FC")) +
ylab(expression("-log10(FDR)"))
pWe can improve these even further by using mathematical notation. To do so, we make use of the expression function. The square brackets indicate that the number “2” should be written as a subscript. Everything in quotes is plain text. The “*” removes spacing between the elements.
p <- p + xlab(expression("log"[2]*"FC")) +
ylab(expression("-log"[10]*"FDR"))
pTheme
Next, we’ll add a theme to our plot. In general, the default ggplot2 theme is instantly recognizable and can make plots look unfinished/amateurish. Customizing the theme is one of the quickest and easiest ways to instantly improve a plot!
Below we apply ggplot2's ‘classic’ theme.
p <- p + theme_classic()
pColour points by significance
Now, we’ll add some visual interest and extra readability through use of some colour.
To do this, we’re going to add an additional column to our original data, stating whether a gene is significantly up-regulated or down-regulated. You don’t need to understand how the code works below but to summarise, we are making use of other functions from other tidyverse packages.
We take our original dataframe and pass it through a series of processing steps using the pipe function, %>%. The first step is to convert the row names to a new column called “Genes”. The second step is to add a new column using the mutate function by evaluating certain conditions. Here we create a new column called “Expression” which says a gene is “Up-regulated” when it has a logFC >=1 and adjusted p-value <= 0.05, a gene is “Down-regulated” when it has a logFC <= -1 and adjusted p-value <= 0.05, and “Unchanged” if it doesn’t meet either of those conditions.
We save all these changes to a new dataframe called “activeTB_vs_control_modified”.
activeTB_vs_control_modified <- activeTB_vs_control %>% # Pipe the dataframe through to the next function
rownames_to_column(var = "Genes") %>% # Add the row names as a new column called gene
mutate(
Expression = case_when(logFC >= 1 & adj.P.Val <= 0.05 ~ "Up-regulated",
logFC <= -1 & adj.P.Val <= 0.05 ~ "Down-regulated",
TRUE ~ "Unchanged") # Add a new column by evaluating certain conditions
)
head(activeTB_vs_control_modified) Genes logFC AveExpr t P.Value adj.P.Val B
1 FCGR1A 3.301767 6.149391 12.02848 1.223836e-21 2.176714e-17 38.65068
2 SERPING1 3.825079 6.433310 11.62863 9.709342e-21 8.634518e-17 36.60604
3 UBE2L6 1.801724 8.197470 11.54765 1.478635e-20 8.766333e-17 36.21297
4 GBP4 2.025176 6.142948 11.33775 4.406000e-20 1.656056e-16 35.14543
5 GBP5 2.653025 8.099339 11.32718 4.655505e-20 1.656056e-16 35.09147
6 WARS 1.992432 8.349681 11.19540 9.250438e-20 2.742138e-16 34.41573
log10_p.adjust Expression
1 16.66220 Up-regulated
2 16.06376 Up-regulated
3 16.05718 Up-regulated
4 15.78092 Up-regulated
5 15.78092 Up-regulated
6 15.56191 Up-regulated
We can add this new data to our ggplot2 object and then add a new layer with the points coloured by “Expression”.
# Add the new data to the ggplot2 object using '%+%'
p <- p %+% activeTB_vs_control_modified
# Add a new layer to the plot, a geometry, the points coloured by "Expression"
p <- p + geom_point(aes(color = Expression))
pThe colours used above are the default colours built into ggplot2. We can set our own custom colours by using R’s inbuilt colours or—if you’d prefer to pick your own—by using 6-digit hex codes. R’s named colours are below, let’s use some of those for now.
p <- p + scale_color_manual(values = c("dodgerblue3", "black", "firebrick3"))
pCan you see any issues with this plot and the previous one…?
To overcome this issue, we’ll create our plot again from the beginning, this time correctly ordering all the geometries.
p <- ggplot(activeTB_vs_control_modified, aes(logFC, log10_p.adjust)) + # New data
geom_point(aes(color = Expression)) + # Geometries, points
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
geom_text_repel(data = top_genes, aes(label=Genes), max.overlaps = Inf, min.segment.length = 0) +
scale_color_manual(values = c("dodgerblue3", "black", "firebrick3")) +
xlab(expression("log"[2]*"FC")) +
ylab(expression("-log"[10]*"FDR")) +
theme_classic()
pTitle
Now our plot is looking good! Finally, we’ll add a title.
p <- p + ggtitle("Volcano plot - active TB vs control")
pSave it
After creating our volcano plot, we likely want to save it to our working directory. To do so we can use ggplot2’s inbuilt function ggsave. An example is below:
# Save plot to PDF
ggsave("volcano_plot_active_TB_vs_control.pdf", # File name we want to save our plot as
p, # The ggplot2 plot object
width = 5, # Width of plot
height = 4) # Height of plotYou’ll note I saved this file as a .pdf instead of a .png (or other image type). This is because plots saved as .pdfs are vectorised, meaning you can scale them as large or small as you like and they won’t get blurry. It also means we can import them into software such as InkScape or Powerpoint to edit them manually if needed.
Altogether
We can bring all this code together and run it in one code chunk. I’ve made a few additional modifications to the code so that gene names for up- and down-regulated genes are both displayed. This is done during the second step where we determine the ‘top_genes’.
# Add columns for gene names and expression
activeTB_vs_control_modified <- activeTB_vs_control %>%
rownames_to_column(var = "Genes") %>%
mutate(
Expression = case_when(logFC >= 1 & adj.P.Val <= 0.05 ~ "Up-regulated",
logFC <= -1 & adj.P.Val <= 0.05 ~ "Down-regulated",
TRUE ~ "Unchanged")
)
# Determine top differentially expressed genes
top_genes <- bind_rows(
activeTB_vs_control_modified %>%
filter(Expression == 'Up-regulated') %>%
arrange(adj.P.Val, desc(abs(logFC))) %>%
head(5),
activeTB_vs_control_modified %>%
filter(Expression == 'Down-regulated') %>%
arrange(adj.P.Val, desc(abs(logFC))) %>%
head(5)
)
# Produce plot
p <- ggplot(activeTB_vs_control_modified, aes(logFC, log10_p.adjust)) + # Use the modified data
geom_point(aes(color = Expression)) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
geom_text_repel(data = top_genes, aes(label=Genes), max.overlaps = Inf, min.segment.length = 0) +
scale_color_manual(values = c("dodgerblue3", "black", "firebrick3")) +
xlab(expression("log"[2]*"FC")) +
ylab(expression("-log"[10]*"FDR")) +
theme_classic() +
ggtitle("Volcano plot - active TB vs control")
# Save plot
ggsave("volcano_plot_active_TB_vs_control.pdf",
p,
width = 5,
height = 4) The very final thing is to give our volcano plot a figure legend. It is important that the figure legend stands alone and contains all the information you need to interpret it. Below is our volcano plot, along with a figure legend typical of what you might see in a published paper.
So bringing this together, what does it all mean? To summarise, our volcano plot demonstrates that most genes are not differentially expressed, as indicated by the black points below the horizontal line and between the two vertical lines. This is typical of most -omics experiments and is what gives rise to the characteristic volcano shape. However, the plot also shows a substantial number of genes that are differentially expressed. Among these, many are highly significant, with adjusted p-values less than 0.01 or even 0.001. Overall, there are more up-regulated genes with large log2FC values. In a scientific paper, the next step would be to characterise these genes and put them in context, for example, by exploring whether genes from the same pathways are co-regulated. This figure is just the beginning of telling a story…
You now have the code to create and the know-how to interpret volcano plots of your own! Below are some exercises to put what you’ve just learned to use.
Box plots and violin plots
Our volcano plots gave us an idea of which genes in our dataset might be significantly different between the active TB and control samples. We might next want to visualise the distribution of the samples in the two groups for our genes of interest. This is is a perfect use for a box plot (Krzywinski and Altman 2014).
For this exercise we will have process our data for it to be suitable for ggplot2.
First, we’ll read the data in.
sample_data <- read.csv("data/normalised_TB_data.csv", row.names=1)
metadata <- read.csv("data/TB_metadata.csv", row.names=1)As previously we’ll do some quick inspections of the data.
dim(sample_data)[1] 17786 175
dim(metadata)[1] 175 4
sample_data[1:6, 1:4] Active.TB TB.progressor TB.progressor.1 TB.progressor.2
TSPAN6 -2.680353 -0.4008798 -1.019455 -1.80882670
DPM1 3.809980 3.4649190 3.837851 4.01469200
SCYL3 3.925248 3.8028715 3.645232 4.81295776
C1orf112 1.844374 1.3013206 1.423615 2.15749010
FGR 10.457478 9.5603012 9.667233 10.59497830
CFH 1.619406 0.9435486 0.976515 -0.05105541
head(metadata) Group Age Ethnicity Gender
Active TB Active TB 29 South Asia Male
TB progressor TB progressor 23 South Asia Female
TB progressor.1 TB progressor 23 South Asia Female
TB progressor.2 TB progressor 23 South Asia Female
TB progressor.3 TB progressor 23 South Asia Female
Active TB.1 Active TB 25 South Asia Male
We can see there are 17,786 rows and 175 columns in the data for our samples, corresponding to the 17,786 genes and 175 samples. The metadata has 175 rows (the samples) and 4 columns. For the metadata the one we are interested in is Group, telling us which TB group the patients belong to.
Single gene
Inspection of the data and our volcano plot shows that ‘FCGR1A’ is the most significant differentially expressed gene in the active TB vs control comparison. We’ll now create box plots to show the distribution of gene expression in the different groups.
Create data frame
We’ll create a dataframe of FCGR1A expression by combining the FCGR1A values from the sample data with the group information from our meta data.
FCGR1A_data <- data.frame(t(sample_data["FCGR1A",]), metadata$Group)
colnames(FCGR1A_data) <- c("Expression", "Group")Produce box plot
To start, we’ll produce a simple box plot. Instead of using geom_point(), we use the geom_boxplot() function to indicate that we are using box plots to represent the data.
p <- ggplot(FCGR1A_data, aes(Group, Expression)) +
geom_boxplot()
pThough this plot is simple, it tells us several things. Perhaps the most obvious is the line inside the boxes, which represents the median (50th percentile), showing the central value of the data in each group. The box itself represents the inter-quartile range, with the top of the boxes representing the 75th percentile and the bottom of the boxes representing the 25th percentile. The vertical lines—or whiskers, as they’re mostly commonly known—show the range of the data, typically excluding outliers. The points outside the whiskers are outliers.
In the context of our data, we can see the Active TB group has the highest median value for the FCGR1A gene, and the Control group has the lowest median value. The whiskers show that the TB progressor group has the widest range of data, we can also see that the Latent TB group has an outlier value.
Now, we are going to improve the appearance of our box plot and include additional information.
Fill the boxes
If we want to fill the inside of our box plots instead of changing the outline we need to use the ‘fill’ argument in our aes() call instead, and use the scale_fill_manual() function to set the colours. It is optional which you’d prefer to colour—if any!—and depends on your preferences.
p <- ggplot(FCGR1A_data, aes(Group, Expression)) +
geom_boxplot(aes(fill=Group)) +
scale_fill_manual(values=c("#455054", "#308695", "#D45769", "#E69D45")) +
xlab("") +
ylab(expression("Log"[2]*" FCGR1A expression")) +
theme_classic()
pSometimes it can be hard to manually chose our own colour palette, especially if you’re new to colour theory and don’t know which colours complement eachother. Luckily there are premade colour palettes in R (as part of the package RColorBrewer) that are available to you in ggplot2. To use a one of these premade colour palettes we use the scale_fill_brewer() function instead of scale_fill_manual(). Inside the function call you need to specify the name of the palette you want to use, in this case we’ll use “Dark2” but some other example palettes are below:
p <- ggplot(FCGR1A_data, aes(Group, Expression)) +
geom_boxplot(aes(fill=Group)) +
scale_fill_brewer(palette="Dark2") + # Use the 'Dark2' premade colour palette
xlab("") +
ylab(expression("Log"[2]*" FCGR1A expression")) +
theme_classic()
pViolin plot
Violin plots are closely related to box plots and often share the same purpose, however they can add additional useful information such as the distribution of the sample data (density trace). This is demonstrated in the figure below.
We will produce a basic violin plot below:
p <- ggplot(FCGR1A_data, aes(Group, Expression)) +
geom_violin()
pBox and violin
It is more typical for violin plots to have either the individual data points displayed on top or a box plot. To produce such a visualisation, we need to layer our ggplot2 object correctly. The code is below, though has been hidden to allow you to attempt the above exercise first.
Code
p <- ggplot(FCGR1A_data, aes(Group, Expression, fill=Group)) +
geom_violin(color=NA, alpha=0.3, width=0.6) +
geom_boxplot(width=0.2) +
scale_fill_manual(values=c("#455054", "#308695", "#D45769", "#E69D45")) +
xlab("") +
ylab(expression("Log"[2]*" FCGR1A expression")) +
theme_classic()
pImprove it further!
At this point our visualisation looks quite nice, but there are a few additional steps we can take to make it publication ready.
Reorder groups
Intentional colours
Hide redundant legend
To reorder the groups in our plot, we need to convert the group column to an ordered factor. This tells R there is an order to our categorical data.
FCGR1A_data$Group <- factor(FCGR1A_data$Group,
ordered = TRUE,
levels = c("Control", "Latent TB", "TB progressor", "Active TB"))Now, let’s produce the plot again.
p <- ggplot(FCGR1A_data, aes(Group, Expression, fill=Group)) +
geom_violin(color=NA, alpha=0.3, width=0.6) +
geom_boxplot(width=0.2) +
scale_fill_manual(values=c("#308695", "#a077a6", "#ab547c", "#ba2b40")) +
xlab("") +
ylab(expression("Log"[2]*" FCGR1A expression")) +
theme_classic() +
theme(legend.position = "none") # Remove redundant legend
pWe have reordered the variables so the ‘Control’ group is shown first, then the other groups in order of severity. Similarly, we have also employed a colour-gradient along this axis of severity. Finally, we have removed the redundant legend on the right by modifying the theme, as the grouping information is already included in the x-axis.
Multiple genes
Now that we’ve learned how to plot a single gene using box and violin plots, let’s move on to visualising multiple genes at once. For this exercise, we’ll use facet_wrap() to display multiple genes in a grid of plots. This allows us to see the expression of several genes of interest across the different groups.
Create a data frame for multiple genes
To get started, we need to extract the expression data for our genes of interest from the sample data and combine it with the group information from the metadata. We’ll do this for the genes: "FCGR1A", "SERPING1", "UBE2L6", "GBP4", "MAGEF1", and "RP5-1184F4.7".
Then, we’ll need to reshape our data into a ‘tidy’ format so that all gene expression data is in a single column, with an additional column to indicate the gene name(Wickham 2016). This will prepare the data for ggplot2 to handle multiple facets.
Data is considered ‘tidy’ when:
Each variable forms a column
Each observation forms a row
Each observational unit forms a table
Let’s first create a dataframe of the gene expression data that we want to convert to a ‘tidy’ format.
# Genes of interest
genes_of_interest <- c("FCGR1A", "SERPING1", "UBE2L6", "GBP4", "MAGEF1", "RP5-1184F4.7")
# Extract the expression data for the genes of interest
multi_gene_data <- data.frame(t(sample_data[genes_of_interest, ]))
# Add group information
multi_gene_data$Group <- metadata$Group
# Inspect data structure
head(multi_gene_data) FCGR1A SERPING1 UBE2L6 GBP4 MAGEF1 RP5.1184F4.7
Active.TB 9.094789 10.028439 10.001784 7.813265 1.992708 -3.4549427
TB.progressor 3.375139 4.706795 7.225487 5.077932 2.850249 0.9744491
TB.progressor.1 5.639912 6.281472 7.883710 5.780228 2.655709 0.8747338
TB.progressor.2 9.538771 10.091418 10.475483 8.616317 1.900342 0.4368986
TB.progressor.3 6.524266 7.947446 8.976370 6.406779 2.474013 0.7824810
Active.TB.1 7.924561 7.958690 9.307978 7.194368 1.859675 -1.3883841
Group
Active.TB Active TB
TB.progressor TB progressor
TB.progressor.1 TB progressor
TB.progressor.2 TB progressor
TB.progressor.3 TB progressor
Active.TB.1 Active TB
Initially, each row represents a sample, and the “Group” column indicates which group (Control, Latent TB, etc.) that sample belongs to.
Now, we will convert the data to a ‘tidy’ format using tidyr‘s pivot_longer function. This ’lengthens’ the data, increasing the number of rows and decreasing the number of columns.
When we use the function below:
-Grouptells R not to pivot the “Group” column - i.e. the grouping information stays as it isThe transformation is applied to all other columns
names_to = "Gene"specifies that the column names should be gathered into a new column called “Gene”.values_to = "Expressionmeans that the original gene expression columns are gathered into a new column called “Expression”
# Reshape data into tidy/long format for ggplot2
multi_gene_data <- multi_gene_data %>%
pivot_longer(-Group, names_to = "Gene", values_to = "Expression")
# Convert group data to ordered factor
multi_gene_data$Group <- factor(multi_gene_data$Group,
ordered = TRUE,
levels = c("Control", "Latent TB", "TB progressor", "Active TB"))
head(multi_gene_data)# A tibble: 6 × 3
Group Gene Expression
<ord> <chr> <dbl>
1 Active TB FCGR1A 9.09
2 Active TB SERPING1 10.0
3 Active TB UBE2L6 10.0
4 Active TB GBP4 7.81
5 Active TB MAGEF1 1.99
6 Active TB RP5.1184F4.7 -3.45
Comparing this dataframe to the one we first produced we can see that it now only has three columns: “Group”, “Gene”, and “Expression”, compared to our earlier dataframe that had separate columns for genes. This means our dataframe is now tidy, as all the information needed to facet the plot is contained in the “Gene” column.
Facet-wrapped plots
Now that our data is in the correct format, we can create a violin plot with box plots laid over the top for each gene. We’ll use facet_wrap() to create individual plots for each gene.
# Create the plot with facet_wrap
p <- ggplot(multi_gene_data, aes(Group, Expression, fill = Group)) +
geom_violin(color = NA, alpha = 0.3, width = 0.6) +
geom_boxplot(width = 0.2) +
facet_wrap(~ Gene, scales = "free_y") + # Create a separate plot for each gene
scale_fill_manual(values = c("#308695", "#a077a6", "#ab547c", "#ba2b40")) +
xlab("") +
ylab(expression("Log"[2]*" expression")) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none") # Rotate x-axis labels for readability, remove redundant legend
pIn this plot:
We use
facet_wrap(~ Gene)to create separate plots for each gene of interest.The
scales = "free_y"option ensures that each gene has its own y-axis scale, which is important when different genes have different expression ranges.We’ve used
geom_violin()function andgeom_boxplot()functions as previously.We’ve rotated the x-axis labels to make the plot more readable.
Faceting in ggplot2 is a powerful tool that enables you to create complex plots efficiently. By facet-wrapping multiple genes, we can quickly visualise the expression patterns across the groups for several genes at the same time (for example, groups of genes in the same pathway). This approach also offers flexibility for customising the appearance and layout of your plots, making it straightforward to produce publication-ready visuals with minimal effort.
Concluding thoughts
At first, ggplot2 can feel overwhelming, especially if you aren’t used to coding. While the plots it produces are beautiful, you might be tempted to revert to familiar tools like Excel or GraphPad for simplicity.
However, imagine a situation where you were comparing 10 or 20 different conditions, where you would need to generate new plots for each condition manually. This can lead to mistakes, inconsistencies in labelling or sizing, and inefficiency. With ggplot2, you can reuse the same code to produce consistent and reproducible plots. As you become more comfortable with R, you can automate your workflow, using loops and functions to generate dozens or even hundreds of plots in a matter of seconds—saving time and ensuring accuracy.
This workshop has provided you with two worked examples of the types of plots you can produce using ggplot2, however, your research projects will be extremely varied and will have their own unique plotting requirements. There are a wealth of free materials, tutorials and videos online and many community forums you can turn to if you get stuck.
Extra/supplementary materials
Plotting outside of ggplot2 - heatmaps!
In this workshop we have learned how to use ggplot2 and how versatile it can be for a number of applications. However, there are a few types of visualisation where other, more specialised, plotting tools can be preferable, for example, heatmaps.
Heatmaps are another commonly used visualisation used to represent -omics data, particularly when dealing with multiple genes/proteins/etc and samples. Heatmaps are often used in bioinformatics to show the expression levels of genes across different samples in a matrix format, where colours represent the expression values. In the following supplementary exercise we are going to compare producing a heatmap with ggplot2 vs an alternative plotting tool: ComplexHeatmap, which offers more functionality for heatmaps, such as sample annotation and hierarchical clustering.
You may first need to install ComplexHeatmap and another supporting package, cowplot, for this portion of the workshop. To do so, run the code below:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("ComplexHeatmap", "cowplot"))Heatmap using ggplot2
First, we will create a basic heatmap using ggplot2. Heatmaps in ggplot2 are created using the geom_tile() function, which allows us to represent each cell of the matrix as a coloured rectangle.
Create Data for Heatmap
We will use the same genes of interest that we explored with our box plots and violin plots. For this heatmap, we’ll extract the expression data for these genes, and then as before we’ll set the Group column as a factor to indicate the ordering of the TB groups. We’ll also include a new additional column for the sample names called ‘Sample_ID’. When we use pivot_longer() both ‘Sample_ID’ and ‘Group’ will be used as ID variables.
Another important step we take is to z-score scale the data for each gene.
A z-score tells us the number of standard deviations a value is from the mean of a given distribution. Negative z-scores indicate the value lies below the mean. Positive z-scores indicate the value lies above the mean.
We want our data to be z-score scaled for each row of the heatmap, i.e. each gene. When we z-score scale the data for each gene we make it so that the mean of all of all the samples is 0 and the standard deviation is 1. This might sound confusing/unintuitive until we see it in action with our heatmaps below.
# Genes of interest
genes_of_interest <- c("FCGR1A", "SERPING1", "UBE2L6", "GBP4", "MAGEF1", "RP5-1184F4.7")
# Extract expression data for the selected genes
heatmap_data <- data.frame(t(sample_data[genes_of_interest, ]))
# Add group information
heatmap_data$Group <- metadata$Group
# Add sample names to a new column
heatmap_data$Sample_ID <- rownames(heatmap_data)
# Reshape data into long format for ggplot2
heatmap_data_long <- heatmap_data %>%
pivot_longer(-c(Sample_ID, Group), names_to = "Gene", values_to = "Expression")
# Z-score scale the expression values for each gene across the samples
heatmap_data_long <- heatmap_data_long %>%
group_by(Gene) %>%
mutate(Expression = scale(Expression))
# Convert group data to ordered factor
heatmap_data_long$Group <- factor(heatmap_data_long$Group,
ordered = TRUE,
levels = c("Control", "Latent TB", "TB progressor", "Active TB"))
# Inspect data structure
head(heatmap_data_long)# A tibble: 6 × 4
# Groups: Gene [6]
Group Sample_ID Gene Expression[,1]
<ord> <chr> <chr> <dbl>
1 Active TB Active.TB FCGR1A 1.75
2 Active TB Active.TB SERPING1 1.80
3 Active TB Active.TB UBE2L6 1.83
4 Active TB Active.TB GBP4 1.54
5 Active TB Active.TB MAGEF1 -1.23
6 Active TB Active.TB RP5.1184F4.7 -2.71
Create a Heatmap with ggplot2
Next, we’ll create the heatmap using ggplot2. We’ll use the geom_tile() geometry to create the heatmap (the heatmap tiles). We’ll set the fill colour to represent the expression values and customise the plot further using themes and colour scales.
Unlike our previous worked examples, where we used discrete colour scales, here will will be using a continuous gradient colour scheme.
# Plot the heatmap
p <- ggplot(heatmap_data_long, aes(x = Sample_ID, y = Gene, fill = Expression)) +
geom_tile() + # Create heatmap cells
scale_fill_gradientn(colours = c("blue", "white", "red")) + # Set colour gradient for expression
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
xlab("Sample ID") +
ylab("Gene") +
ggtitle("Heatmap of Gene Expression")
pHere we have quite a basic heatmap. The rows are the genes, the columns are the individual samples, so each individual tile represents each combination of a gene and a sample. The tiles are coloured by the Z-score scaled expression values. This allows us to see patterns of expression for each gene, relative to eachother.
The columns are ordered so that the Active TB samples are first, then the Control samples, then the Latent TB samples and finally the TB progressor samples. This allows us to see some vague patterns in the data. We can see that FCGR1A, SERPING1, UBE2L6 and GBP4 levels tend to be higher in the Active TB and TB progressor groups vs the Control and Latent TB groups. Inversely, MAGEF1 and RP5-1184F4.7 appear to have lower expression levels in the Active TB group vs every other group.
Improve it
However, this heatmap has a few short-comings. As there are so many samples it is difficult to read the labels on the x-axis. ggplot2 doesn’t include native support for cluster dendograms so it is difficult to get an idea of the relationship between samples.
Finally, we lack any sort of ‘Group’ information/annotation. We can manually create a ‘Group’ visualisation that we can combine with our heatmap. The way this works is that we create a seperate heatmap for our ‘Group’ information. We set our x-axis as our Sample_ID and our y-axis as 1 to create a grid 1 tile high. Then we fill the tiles with our grouping information.
# Load supporting library for creating plot grid
library(cowplot)
# Generate annotation plot for group information
annotation_plot <- ggplot(heatmap_data_long, aes(x = Sample_ID, y = 1, fill = Group)) +
geom_tile() +
scale_fill_manual(values = c("Control" = "#308695",
"Latent TB" = "#a077a6",
"TB progressor" = "#ab547c",
"Active TB" = "#ba2b40")) +
theme_void() + # Remove axis and grid lines
theme(legend.position = "top") + # Place legend on top
ylab("Group") +
guides(fill = guide_legend(title = "Group")) +
ggtitle("Heatmap of Gene Expression\n") # Reuse title here so it appears on top of final plot
# Remove x-axis labels from the main heatmap now that we have group labelling
p <- p + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
ggtitle("")
# Combine annotation plot and heatmap using cowplot::plot_grid
combined_plot <- plot_grid(annotation_plot, p, ncol = 1, align = "v", rel_heights = c(0.3, 1))
# Display the combined plot
combined_plotThough this is better, it is quite a lot of code/effort to produce an overall sub par heatmap!
Heatmap using ComplexHeatmap
Now, let’s explore the ComplexHeatmap package, which simplifies the creation of heatmaps with enhanced features.
Load relevant packages
# Load necessary libraries
library(ComplexHeatmap)
library(circlize) # For colour scalingPrepare data for ComplexHeatmap
Next, we’ll prepare the data for the ComplexHeatmap. One difference between ggplot2 and ComplexHeatmap is that ggplot2 expects the data to be in a tidy/long format, whereas ComplexHeatmap expects the data to be a matrix, with the gene names as the row names and the sample names as the column names.
As with our ggplot2 example, we will be z-score scaling our row data (genes) so that the mean of all of the values is 0 and the standard deviation is 1.
# Extract and scale the expression data for genes of interest
heatmap_data <- sample_data[genes_of_interest, ]
heatmap_data_scaled <- t(apply(heatmap_data, 1, scale)) # Z-score scale per gene
colnames(heatmap_data_scaled) <- colnames(heatmap_data)
# Prepare group information for annotation
group_annotation <- metadata$GroupCreate a Basic Heatmap with ComplexHeatmap
We can now create a heatmap using the Heatmap() function from ComplexHeatmap. This function supports hierarchical clustering and allows us to add annotations and other customisation.
# Create a colour scale for expression values
col_fun <- colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))
# Create heatmap annotations for groups
ha <- HeatmapAnnotation(Group = group_annotation,
col = list(Group = c("Control" = "#308695", "Latent TB" = "#a077a6",
"TB progressor" = "#ab547c", "Active TB" = "#ba2b40")))
# Generate the heatmap
ht <- ComplexHeatmap::Heatmap(heatmap_data_scaled,
name = "Expression",
col = col_fun,
top_annotation = ha,
show_row_names = TRUE,
show_column_names = FALSE,
cluster_rows = TRUE,
cluster_columns = TRUE)To explain briefly what the code above is doing:
First we create a colour gradient function for our heatmap using
colorRamp2and assign it tocol_funSecondly, we create an annotation object (
ha) for our heatmap, indicating the different groups (e.g., Control, Latent TB) of the samples. Each group is assigned a specific colour.The final part generates the heatmap (
ht) itself using theHeatmapfunction.heatmap_data_scaled: The scaled expression data to be visualised.name: Sets the title for the colour scale (in this case, “Expression”).col: Applies the colour scale function created earlier (col_fun).top_annotation: Adds the group annotations (ha) at the top of the heatmap.show_row_names: Displays the names of the genes on the y-axis.show_column_names: Hides the sample names on the x-axis.cluster_rowsandcluster_columns: Carry out hierarchical clustering on the row and column data.
Now, let’s visualise our heatmap.
# Draw the heatmap
draw(ht)This heatmap contains a lot more information compared to the one we produced with ggplot2!
From the row dendograms, we can see that MAGEF1 and RP5-1184F4.7 are both downregulated in Active TB and cluster together. The genes that are upregulated in Active TB also cluster together, with SERPING1 and UBE2L6 being most similar to eachother.
From the column dendograms we can see that there are three main clusters. The majority of the Active TB samples cluster together, with one group forming a large cluster on the left. The next cluster of samples is more mixed, though contains the majority of our TB progressor samples. This fits with the information we’d previously gained from our violin plots and box plots, showing the broad distribution of expression within the TB progressor samples. The final cluster on the right contains the majority of the control and latent TB samples - it is unsurprising they cluster together, as our violin plots showed the similar patterns of expressions for the selected genes within these two groups. By carrying out hierarchical clustering we are able to display these more sophisticated relationships between the samples.
Extra Exercises
Adding extra detail to our heatmap
Finally, one of the thing that really sets ComplexHeatmap apart is the level of annotation we can add to our heatmaps. We will demonstrate this below by adding ‘age’ and ‘gender’ annotation from our metadata table.
# Create a colour scale for expression values
col_fun <- colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))
# Create a color scale for continuous 'Age'
age_col_fun <- colorRamp2(c(min(metadata$Age), max(metadata$Age)), c("lightyellow", "darkorange"))
# Create a color scale for categorical 'Group', 'Ethnicity', and 'Gender'
col_fun_group <- list(Group = c("Control" = "#308695",
"Latent TB" = "#a077a6",
"TB progressor" = "#ab547c",
"Active TB" = "#ba2b40"))
col_fun_gender <- list(Gender = c("Male" = "#1F78B4", "Female" = "#FB9A99"))
# Create heatmap annotations for groups, ethnicity, gender, and age
ha <- HeatmapAnnotation(Group = metadata$Group,
Gender = metadata$Gender,
Age = metadata$Age,
col = c(col_fun_group, col_fun_gender,
Age = age_col_fun))
# Generate the heatmap with added annotations
ht_extra_anno <- ComplexHeatmap::Heatmap(heatmap_data_scaled,
name = "Expression",
col = col_fun,
top_annotation = ha,
show_row_names = FALSE, # row names are hidden due to number of genes
show_column_names = FALSE,
cluster_rows = TRUE,
cluster_columns = TRUE)
# Draw the heatmap
draw(ht_extra_anno)We can see that the samples still cluster mostly by ‘Group’, but there appears to be some mild clustering by ‘Gender’. ‘Age’ doesn’t appear to have much effect on the clustering.
Advantages of ComplexHeatmap
ComplexHeatmap offers several advantages over ggplot2 for heatmap generation:
Hierarchical Clustering: By default,
ComplexHeatmapperforms clustering of both rows (genes) and columns (samples). This is crucial for identifying patterns in the data.Annotations: It is easy to add sample annotations, such as group, gender, ethnicity, etc.
Dendrograms: The package includes dendrograms to show relationships between samples and genes based on their expression patterns.
Comparison: ggplot2 vs. ComplexHeatmap
While both ggplot2 and ComplexHeatmap are powerful, each has its strengths for specific tasks:
| Feature | ggplot2 | ComplexHeatmap |
|---|---|---|
| Ease of use | Extremely flexible but can require a lot of manual setup for certain types of plot | Mostly specialised for heatmaps, easy to produce complex heatmaps with multiple levels of annotation |
| Clustering | Requires manual calculation and ordering | Automatic hierarchical clustering |
| Sample annotations | Limited, manual setup needed, can require the usage of additional supporting packages such as patchwork or cowplot |
Simple and intuitive with HeatmapAnnotation() |
| Dendrograms | Not supported | Supported and automatic |
| Customisability | Extremely flexible for all plot types | Specialised for heatmaps |
In summary, ggplot2 is a general-purpose plotting package and can handle a wide variety of visualisations. In contrast, ComplexHeatmap excels in producing heatmaps with features commonly needed in bioinformatics analysis, such as clustering and sample annotations (especially in the case where many groupings are present).
Overall, this practical has aimed to showcase ggplot2 and its capabilities, but also it’s occasional limitations. Sometimes it is better to use a tool that is already specialised for your needs, rather than attempting to code something from scratch. Understanding the strengths of specialised tools is essential in making informed choices for visualising data effectively in your analyses.
Further reading/resources
- https://rpubs.com/odenipinedo/introduction-to-data-visualization-with-ggplot2 - Additional worked examples of
ggplot2. - https://r-graph-gallery.com/ - The R Graph Gallery, contains examples of nearly every type of plot you could hope to make and example code to do so
- https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003833 ‘Ten Simple Rules for Better Figures’ - useful article for improving figures
- http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html ‘Top 50 ggplot2 Visualizations - The Master List’ - other examples of plots that can be made in
ggplot2 - https://www.youtube.com/watch?v=9YTNYT1maa4 Hadley Wickham presentation on
ggplot2 - https://github.com/erikgahner/awesome-ggplot2 - GitHub page containing extensive
ggplot2resources including tutorials, books, expansions, online courses, themes and colour palettes - http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html - a very detailed scatter plot tutorial including line of best fit and using premade
ColorBrewerpalettes - http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/ - extremely detailed
ggplotcheat sheet